Japanese Text Normalizer — NFKC, kana, whitespace, sentences avatar

Japanese Text Normalizer — NFKC, kana, whitespace, sentences

Pricing

Pay per usage

Go to Apify Store
Japanese Text Normalizer — NFKC, kana, whitespace, sentences

Japanese Text Normalizer — NFKC, kana, whitespace, sentences

Normalize Japanese text for data pipelines: Unicode NFKC (full/half-width unification), wave-dash unification, whitespace cleanup, hiragana/katakana conversion, Japanese-aware sentence splitting, and per-script character stats.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Shinobu Otani

Shinobu Otani

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Categories

Share

Japanese Text Normalizer

Clean and normalize Japanese text for search indexes, datasets, and LLM pipelines — deterministic, instant, no LLM cost.

What it does

  • Unicode NFKC: full-width alphanumerics → ASCII (ClaudeClaude), half-width katakana → full-width (ガイドガイド)
  • Wave-dash unification: (U+FF5E) → (U+301C), without touching real ASCII tildes in paths/URLs
  • Whitespace cleanup: collapses space runs (including ideographic spaces), trims line ends, collapses 3+ blank lines, normalizes CRLF
  • Kana conversion: hiragana ↔ katakana (optional)
  • Sentence segmentation: Japanese-aware (。!? with closing-quote handling) plus Latin punctuation
  • Character statistics: per-script counts (hiragana / katakana / kanji / ASCII / digits) before and after

Input

{
"texts": ["Claude Codeで開発する。「すごい」と思った。"],
"kana": "none",
"split_sentences": true
}

Output (one dataset item per text)

{
"text": "Claude Codeで開発する。「すごい」と思った。",
"changed": true,
"sentences": ["Claude Codeで開発する。", "「すごい」と思った。"],
"sentence_count": 2,
"stats_before": {"hiragana": 8, "katakana": 0, "kanji": 4, "...": "..."},
"stats_after": {"...": "..."}
}

Typical uses

  • Preprocessing scraped Japanese text before indexing or embedding
  • Unifying mixed full-width/half-width product data
  • Sentence-level dataset construction from raw Japanese prose